Accessing open data: the aims of this tutorial

There is now a wealth of data available online. These come from a variety of sources (crowdsourced data, online transaction data, administrative data, and so on), and in many formats (csv file, xml or json files, or through application program interfaces (APIs)). You will often want to access such data, and use it for your own work or research.

This tutorial will show you how to access data from the web using QGIS. We will write some python scripts (don’t worry, you will only have to copy and paste) and also use some plugins available in QGIS.

We will perform some web-scraping, for getting data from websites taking advantage of their HTML tags, and we will also be collecting tweets available through the twitter API.

This tutorial is meant to give you a flavour of how you can access data from online resources, and import it directly into QGIS. The code will be here for your reference, and this tutorial will be available online here. If you have any questions later, do not hesitate to get in touch through email.

The overall aim is to get you familiar with these technique, and to get you thikning about the possibilities for collecting data to aide new insigt into the topic areas in which your main research or professional interests lie. To achieve this, we will carry out two tasks. Firstly we will try to visualise the distribution of reports of environmental issues made using the online problem reporting website fix my street. Is everyone reporting in London? Is there a difference beetween England, and Wales, and Scotland? These are questions that we would be able to quickly begin to consider, if we were to see this data mapped. The first task will guide you through the process of scraping data from a table on the web, and getting it into QGIS into a format which you can map.

The second task will see us acquiring tweets about a topic of interest. The example I use are tweets about the “night tube”. London’s Underground network recently introduced a 24-hour servide on weekends, however this was not without its own controversies. The launch was delayed, and was preceded by several tube strikes, much to commuters’ inconvenience. Many Londoners took to twitter to express their concerns, and so twitter offered an excellent source of data to gain insight into the reasons behind people’s upset. This tutorial illustrates a guide on how to sign up for a twitter developer accound, and use the API to retreive tweets on a particular topic.

By following along with these tutorials it is hoped that you gain some hands-on experience with these new forms of data, and begin on a path of learning and ‘hacking’ your way to diverse insight and new perspectives about the topics on which you focus.

Tools

Throughout this tutorials we will be using 2 main tools, QGIS and Python. I will introduce them both briefly here, giving enough context required for this tutorial.

QGIS

The main tool we will be using is QGIS. QGIS functions as geographic information system (GIS) software, allowing users to analyze and edit spatial information, in addition to composing and exporting graphical maps. Throughout this tutorial I assume that you have some experience using QGIS (or a similar GIS) and you are familiar with spatial data handling and analysis. If you are interested, you can learn more about QGIS here:

Plugins

QGIS has a variety of plugins that you can download, and use for your work. Plugins in QGIS add useful features to the software. Plugins are written by QGIS developers and other independent users who want to extend the core functionality of the software. These plugins are made available in QGIS for all the users.

You can see a tutorial for installing and using plugins here

We will be using 2 plugins in this tutorial. First is the Python Console. This should already be available for you, when you click on Plugins > Python Console. So there isn’t anything furhter you need to do for using this, and we will return to it in a bit.

The second plugin we’ll be using is called twitter2qgis and it is for getting data from Twitter into QGIS. This is actually an experimental plugin. The plugins that are available to you for installation depend on which plugin repositories you are configured to use. QGIS plugins are stored online in repositories. By default, only the official repositories are active, meaning that you can only access official plugins. These are usually the first plugins you want, because they have been tested thoroughly and are often included in QGIS by default. It is possible, however, to try out more plugins than the default ones. To see experimental plugins you open the Settings tab in the Plugin Manager dialog:

Select the option to display Experimental Plugins by selecting the Show also experimental plugins checkbox.

To install twitter2qgis now click on the All tab, and in the search bar, type in “twitter2qgis”. You will see the plugin appear. Select and install it using Install plugin. When finished, you will now see it under the Web tab in QGIS.

Python

We will also be making use of some Python code to use within the QGIS environment. Python is the language on which QGIS is built, and the plugins you might already be using will have been written by people with such code. It’s possible to write your own plugins, or to write scripts which you can automatically execute from within the QGIS environmenr. Here I will be showing you the code directly for one exercise (web scraping) so that you can get a sense of what such a script would be doing, and how you can edit it to fit your needs.

For those unfamiliar with it, Python is a widely used high-level programming language for general-purpose programming, created by Guido van Rossum and first released in 1991. Python can be easy to pick up whether you’re a first time programmer or you’re experienced with other languages. For those interested here are some ways to get started:

Setting up Python

If you use a cluster PC, you can skip this step. However, if you want to follow this along on your own laptop, or you will have to install and setup Python.

Mac users

Python comes with OS X, so you can probably don’t need to do a separate install. You can check this by typing python --version into Terminal. Apple’s Terminal app is a direct interface to OS X’s bash shell — part of its UNIX underpinnings. When you open it, Terminal presents you with a white text screen, logged in with your OS X user account by default. You can type your commands in there. If you don’t already use Terminal I recommend reading up on how it could be useful for you here. But that’s outside of the scope for now, so for now it’s enough if you know how to open it up, and then you type (or copy and paste) the code python --version, and hit Enter.

If you get an error message, you need to install Python. If Terminal prints something like “Python 2.7.3” (where the exact numbers you see may be different), you’re all set to move on to the next section.

For those who received the error message, you will need to follow the steps in this tutorial to get Python installed.

PC users

PC users will hopefully have had a tutorial prepared by Oscar, and in any case there are many tutorials available online for installing Python:

Writing Python

You will also have heard about IDEs in some of these tutorials. An integrated development environment (IDE) is a software application that provides comprehensive facilities to computer programmers for software development. This is normally the environment where programmers write their code. You can of course use anything that can edit text (I prefer sublime text for a basic text editor) or something mode sophisticated, for example Eclipse, which contains in it a collection of tools, including tools for debugging, GUI builders and tools for modeling, testing, and more.

You can write such code in many such dedicated development environments, however for our purposes here, QGIS provides a built-in console where you can execute python code. This console is a quick way to learn scripting and do quick data processing.

You can open the Python Console by going to Plugins > Python Console:

This will open a little window, where you can paste the Python code from this tutorial, or write your own. While no programming experience is required (or taught really) here, I will describe each bit of code that we use in detail, so that you have an understanding of what you are doing, why, and how you can change this if you wanted to implement it in your own work.

Writing your first bit of code

So to demistify this process for anyone who might not have written any code before, let’s carry out a quick exercise.

We will start with writing something super simple. I won’t go into great detail here, but one of the basic units of writing any code is a variable. Put most simply, a variable is something that varies. You can call this variable anything (eg: height) and give it any value (163cm for example). But then, you can refer to this variable throughout your workflow, and it will hold all the values which you give it.

In this first instance, we will create a variable the contains some text. In python you assign a value to a variable with the = sign.

So here we will create a variable called hello and give it the value “hello world :)” because we’re happy people. You can do this with the following code:

hello = "hello world :)"

Now you have assigned the value hello world :) to the variable called hello. Try pasting (or directly writing) this into the Python Console:

Now this varaible, called hello is in your environment.

Try what happens if you just type it’s name (hello) into the console, to call it. You can now do anythinto the values in this variable, by referring to the variable name.

So for example, if you wanted to print the same thing but all in upper case, you can now apply the upper() function to this variable.

So you would type:

hello.upper()

Congratulations you have just written your first bit of code! Woohoo!


Setting up

One of the most important things to do before you start working is to create an organised workspace. It is important when you are downloading data from the web that you know where these data get saved. And when you are reading in data to QGIS, again it’s good practice to point QGIS in the right direction to read those data from.

Setting up a working directory

So firstly, before we begin to do any work, we should create our working directory. This is simply a folder where you will save all our data, and also where you will be reading data in from. You can create a new folder, where you will save everything for this project, or you can choose an existing folder. It’s advised that you create a folder, and also give it some name you remember, that will be meaningful. Generally try to avoid spaces and special characters in folder (and file) names. It’s not necessarily a good idea to just dump everything into ‘Desktop’ either, as you want to be able to find things later, and maybe keep things tidy.

Anyway, once you have a folder identified, you will need to know the path to this folder. That is the route that you will be using in your code to read/write files from/to the right directory. Often you will get errors, about certain things “not found” due to incorrect file paths. So it’s important that we find the correct path. There are multiple ways of finding the correct path for both macs and PCs (and Linux but if you are a Linux user then I will assume you should already know all this…!). I will give an example for a mac and one for a PC here.

On a mac you can find the path to a specific file or folder by first opening Terminal, then opening Finder and navigating to the folder or file in Finder. Once you have found it, just drag and drop the folder or file it into the Terminal window. This will print out the path to your file or folder:

On a PC, you can fing a path to a file or folder by navitgating to it using Windows Explorer and once there, copying the file path from the top bar. This is illustrated by the red circle below:

NOTE: When you copy this file path from the PC version, you will have to change the direction of the dashes, when passing this as a text string to a variable in Python. So you will have to replace all backslash (\) with forwardslash (/).

For example: C:\Users\mesike\Desktop\dokumentumok should become C:/Users/mesike/Desktop/dokumentumok

Once you have this path, we will use this to create a new variable, that will contain the text value of your path. To do this, you are essentially repeating what we did with the hello world exercise.

You can see below I have provided you a template, where I create a variable called my_path using my filepath that I retreived:

my_path = "/Users/reka/Desktop/openDataTut"

You can copy the above into a text editor, and then within there replace the path with the one that is relevant to you. Make sure it’s all inside the quotation marks (“”) and that there is no space on either end.

NOTE: Be careful with your choice of text editor though because some text editors mess about with quotation marks. If you are using a text editor that is not created for writing code, or somehow uses non UTF-8 characters then you might encounter an error when you execute this code, telling you about an illegal character. If this happens, just delete and re-type the quotation marks within the QGIS python console. I would suggest you use a text editor like sublime text 2 mentioned earlier, or Notepad++ that is meant for writing code in.

Anyway, create this variable called my_path by replacing my filepath above whith yours. Then run this in the QGIS python console, to create this variable. You can check that you have succeeded, and the value the variable currently holds, the same way we did in the hello world exercise. Just like we created the hello object to hold your text saying “hello world :)”, we have now created this path object, to hold the path to your folder, where you will be saving everything.

So now you have this my_path variable, and you can use it to set the working directry for this session. You can do this by copying and pasting the below code into the QGIS python console:


#import the os so you can make changes
import os 

#set path to the my_path variable we created earlier
path = my_path

#now change the directory (chdir() function) to this new path
os.chdir(path)

What you are doing here, is making sure that this folder is where we will be saving all the data, and also if we tell QGIS to read in some files, it will look for them in this folder as well.

Importing some modules

Python is great and versatile because it has many modules that people have created, which enable you to easily do certain operations. Basically, modules are bits of code, that someone has written to perform somethin specific. For example there is a module for graphing, there is a module for webscraping, and so on. The nice thing about these is that someone has written thes euseful functions, and we can just import the module and run these functions, without having to write the complicated stuff ourself. Further, because python is open source and anyone can contribute, there are many modules that correspond to pepople’s many interests. You could write your own if you wanted to!

You will have already installed the modules, or if you are using the PCs provided in the computer lab, then everything should be pre-installed for you.

While modules only need to be installed once (the first time you use them), to use them within your code you have to import them into the section of code where you will be using them. This loads the code from the module into your environment, allowing you to use all the handy functions. To import a module, you can either import everything with the command import module_name, or you can import just certain parts with import thing_i_want from module_name.

In this tutorial we use code from 4 modules (actually, those paying close attention will have noticed that we use 5, but one we already accessed when setting the working directory…).

We use the beautiful soup module, written specifically for webscraping. We also use csv so that we can save our tables as csv, and we also use requests, so that we can make request to a URL (essentially to we can load a page, and get it’s html back for scraping). Finally we also use tweepy, for getting tweets!

Installing modules

If you are working from the computer cluster PCs then you are all set, all these modules will be already downloaded for you. However, if you are working on your own computer, you will have to download these yourselves. As Oscar has shared, this link has comprehensive guidelines on installation of modules. However I’ve curated some friendly step-by-step guides if that helps at all:

Installing modules on a PC:

This tutorial is a simple video to follow along for installing modules using easy_install with PC.

Installing modules on a mac:

Personally I use homebrew for mac. It’s super handy. Here is a video to install Homebrew and here is a guide of using homebrew for installing python 3

So homebrew also gets you easy_install (which you will recognise from the PC tutorial) So you can use that to first get pip, with this code in terminal:

easy_install pip

And then use pip to install modules with commands such as:

pip install tweepy

Loading modules

To load up all these modules, you will have to copy and paste the below code:


#The beautiful soup module is what we will be using for scraping data from webpages
from bs4 import BeautifulSoup

#You will also need the csv module to save csv
import csv

#And you will need the requests to get data from an url
import requests

Now that these modules are imported, we are ready to use them in our next section, where we will carry out some webscraping, to get some data into QGIS from the web.

Webscraping

Web scraping (web harvesting or web data extraction) is data scraping used for extracting data from websites. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis. Wikipedia.

In the lecture I described scraping data from Fix My Street. That exercise was a larger scale project that took quite some time. So we will not replicate that here. Instead, we will carry out a small-scale project here, to give you an idea (and the skills) for doing this yourself, perhaps on larger scales.

Looking ‘behind the scenes’

As mentioned above, webscraping using beautiful soup makes use of the html tags on a page to collect information into a tabular format, that you can then use for analysis. To understand how you can use this, it is important to see the structure of a page.

If you have some experience writing html then you might be at an advantage here. But even if you have not seen anything like this before, do not worry, it is easier than you think!

So first we have to identify a page that has some data we’re interested in. Let’s stick with fixmystreet.com. So let’s have a look at the data they have here, about the number of reports in each local authority. You can see this in a table here.

So this is interesting, some councils have way more reports than others, and we might want to use this to make some maps. But in order to do so, we need to get this data into our QGIS environment. The approach we will follow here is to make use of the code behind the webpage, to systematically scrape that data. To do this, we have to tell our computer to look at the page, and grab the relevant information for us.

So what does this look like to a computer? Well you can have a look at this by right clicking anywhere on the page, and selecting View page source.

This should open up a new tab, which will be the same exact page that you were viewing, but the html behind it. This is what the page we were just viewing looks like to your computer. Fun, no?

You can see keywords within triangle brackets (<>). These are called tags. So for example, <h1> is a tag for a heading. So if I wanted to get all the headings from a page, I would need my code to return everything between a <h1> and a </h1> tag. You can learn more about html tags if you want. Some good tutorials can be found here and here.

Anyway, for the purposes of web scraping, tags are helpful so we can use them to point our bit of code to these specific tags, to retreive the appropriate data. In the case of getting the data from this table, we will be looking for the <table> tag. So there are a few steps that we will follow here.

First, we need to actually get this background code from the web, and into our envrionemnt, so we can parse it. Then we parse the code, which is basically making sense of the code. This is where the Python module beautifulsoup is very handy, as someone has written some code to make this a lot easier for us. Then we can save the parsed data table, and load it up quick into our QGIS environment. So let’s go through these steps now, one by one.

Getting the html

So first things first, we want to make a connection, and download the page contents, so we can use the beautifulsoup moule to go through it to extract the appropriate data from the specified fields.

The first step is to specify the web address, where we should get this data from. We can do this by creating a variable, url, with the website address as a value. We will do this in the same way that we created our hello and my_path variables. We will create a new variable (url) and assign (=) the value of the web address ("https://www.fixmystreet.com/reports").

Putting it all together:

url = "https://www.fixmystreet.com/reports"

So as we have done before, write this into the Python console in QGIS.

Excellent. Now we can use some code from the request module (loaded earlier) in order to get all the html content of this page, which we saw earlier in our browser using ‘view source’ into a variable called code. This is essentially the same as if we were creating this variable how we created the earlier ones (hello, my_path, url) but now on the other side of the equation, we are actually reading the data in from the web, rather than having to type it all out ourselves. Neat!

response = requests.get(url)
code = response.content

So now we have this variable, code in our environment.

You can now again do things to this variable, the same way you could with the hello variable. For example, if you copy and paste the code below, it should print the first 50 characters of the variable:

print code[0:50]

If you look back onto the source of the page we viewied in the browser earlier, then you will see that it’s basically the same text! There might be some differences, which will be due to encoding issues. You can find out more about encoding here and here.

Anyway now that we have all the data in this code variable, we can extract the data we want from it by making use of the tags. This is the process of parsing the html stored in this code variable. We use here code from the beautifulsoup module, to make this easier. Specifically we turn our html into a beautifulsoup object, using the soup = BeautifulSoup() function. This gives the code a nested data structure.

soup = BeautifulSoup(code, "html.parser")

You can again have a look at what this new object looks like using the print fuction:

print soup.prettify()

The above code will print out the whole page’s html.

Now we don’t want all this though, do we? We just want to extract the data from the table. So here we can use the .find() function, to search for the table tag, and extract the data from there. You can do this with the code below:

table = soup.find('table', attrs={"class" : "nicetable"})

You can see that we pass to this function what we want it to find (), and also this attrs{} where we pass something that can be used to identify what table or tables we want to download. This is something that you will have to change and adapt to the tables that you might want to download for your own research. So how do you find this? It will vary by the page you are searching. However here, we can do something simple, like search for a keyword from the table, in the html of the page. So if you go back to viewing the source for the page, you can use your browser’s search function to find the table you want to download. In this case, let’s search for the first entry in the table, which is Aberdeen City Council:

So you can see where this entry is. But what we are really looking for is something within the <table> tag, that could help us identify this table.

So here, we see a few things included within this table tag. We can choose something from here, for example the class of this table, to select it by. You have to be careful, that if we only want the contents of this table, then we must make sure that there is not also another table, which fits this description as well. So you want something that identifies this table, and this table alone. In this case we only have this one table here, so we are safe and happy! Yay!

So inside this attrs parameter, we put in our key value pair of class and nicetable, and this helps find our table. Once that code is run, we now have this table object. But it’s not quite the table we are looking for.

If you don’t believe me, have a look! You know how to look at things by now, our handy dandy print function:

print table

You will see a whole bunch of tags in there, and it’s just not as neat and tidy as we would like in order to have it as our attribute table that we can join to a shapefile, and use in our mapping. So there’s one more step to cleaning this up. We achieve this by running the below code, to use these tags further to sort things into cells, rows, columns, all the things we are used to in our regular everyday tables.

list_of_headers = []  # define list for headers
for header in table.find_all('th'):  # iterate through table headers
    list_of_headers.append(header.get_text())  # extract text from html tag and add to list called headers

list_of_rows = []   # define list for rows
for row in table.findAll('tr'):  # iterate through table rows
    list_of_cells = []  # define list for cells
    for cell in row.findAll('td'):  # iterate through table cells
        list_of_cells.append(cell.get_text()) # extract text from html tag and add to list called headers
    list_of_rows.append(list_of_cells) # when a row is finished, append cells to list of rows

And once we have run this, we can then save our now normal everyday looking table into our workspace, which we specified at the start of this tutorial. But first let’s give it a name. You can do this by changing the part where it says your_filename_here to your desired filename in the code below:

my_file_name = "your_filename_here.csv"

So in this case, let’s be totally uncreative, and call this object output.csv. Note to your future selves: you probably always want to give some meaningful name to your data, so that when you come back to it some months later, you know what’s what, and don’t have a whole bunch of files (output1, output2, etc) which you are unsure about what even is in them!

my_file_name = "output.csv"

Great stuff. Now we can finally save our table

outfile = open(my_file_name, "w")
writer = csv.writer(outfile, lineterminator="\n")
writer.writerow(list_of_headers)
writer.writerows(list_of_rows)

You should now see your file, output.csv appeared in your working directory folder. If you want, you can open this with something like excel to have a look at your table. But you don’t really need to at this point, instead, what we will do is read this back into QGIS as an attribute table to be joined to some shapefile.

So you can do this the pointy-clicky way, as you would normally load a .csv file, or you can just copy and paste the below code into the Python Console, and that will load the table straight into your QGIS environment:

uri = "file:///" + path + "/" + my_file_name + "?type=csv&geomType=none"
layer = QgsVectorLayer(uri, "fms_table", "delimitedtext") # create layer from text file called fms_table
QgsMapLayerRegistry.instance().addMapLayer(layer) # add layer to QGIS

You should now see your new table appear in your QGIS layers.

So which local authorities have the most reports?

First: MORE data clearning (ugh)

OK so now that we have this table we can link it to a shapefile of local authorities in the UK.

So you can do this in many ways, you can get the shapefiles manually, but the point of this session is to try to get everything from online as smoothly as possible. So for this we want to use a boundary format called geojson, which is just a smaller file basically, to facilitate quick download and sharing of spatial boundary files.

To load a geojson of UK local authorities, open up the dialogue to add a new vector layer in QGIS.

Then select, and make sure that geojson is selected from the drop-down menu. Then under URL, paste the following URL:

https://raw.githubusercontent.com/martinjc/UK-GeoJSON/master/json/administrative/gb/lad.json

Then click OK and the shapefile should load into your QGIS environment.

Now you have this shapefile of the UK local authorities. However, as is very common with “found” data, the names for variables is not always what you want them to be. In fact when working with open data, a major chunk of your time will be spent on data clearning and data wrangling, turning it into something acceptable for use. So to link our table of reports to this new shapefile, we need to make sure that their descriptions match. Open the attribute tables, and have a look at what columns you could use to match them up. Now you will notice that the text is slightly different in these columns. In order to join up the data, as you might have done in other tutorials in your course, you will make to make sure that everything corresponds.

So we can see that the main difference here is the inclusion of words like “Council”, “District Council”, and “Borough Council”. So one approach is to create a new column, where we remove these from the name.

So for this we use the column calculator. You can click on the little abacus icon to bring up the field calculator window for calculating new column:

Once you have the window here you can enter any value to calculate a new column. I assume that you have used this in other tutorials, but if not, you can find the documentation for using the field calculator here, and a quick youtube demo here.

So you want to create a new variable, so we give it a name, let’s say because its a new version of name that we are using to link the data to the map, we can call it “linkName”. And we want it to be a text variable, so we can select that as well:

Now we want to calculate a new field from the existing field called “Name”, but we want to remove the words that occur here but not in the geojson attribute table (“District Council” etc).

You can use the replace() function for this. If we only had to replace all instances of “District Council”, then that would look something like this:

replace("Name",'District Council','')

Here we say that we want to replace everything in the "Name" variable that matches the string ‘District Council’ with nothing (you can see there’s nothing between the second set of quotes).

However, we actually want to replace more things, not just “District Council”. We also need to replace all instances of “Borough Council” and “Council” so we need to combine these together into one:

replace(replace(replace("Name",' District Council','') , ' Borough Council',''), ' Council','')

If you copy and paste the above into the variable calculator, then we create a new variable, as previuosly named “linkNames” and it will be the same as “Name” except with all instances of the above strings removed.

Now if you click OK, this will result in a new column, where these will be removed, and you can now use this to join the table of fixmystreet reports to the geojson shape file! So exciting!

NOTE: dity data :(

So there is something to mention here. This way of getting rid of fields that might not match the values in the geojson column is not very “clean”. We are guessing what the simplest operation can be to match them up. There are of course errors. For example, this way in the column in the FixMy Street data, we have “York City”, which in the geojson attribute table is just “York”. So this will not join. There are many things you can do here, you could go through each line and match the closest possible match, or you can do this computationally, for example creating some closeness scores, or identifying words that sound like the target work using soundex for example. I do not have time in this tutorial to cover this, but this can be an interesting area to learn more about. If intersted, I recommend a read of the Bad Data Handbook.

However here, for the sake of speed, we will just select the rows which match the rows in the target spatial layer.

Now: JOIN the data!

OK so again I hope that joining tables to shapefiles is something which you have covered in some of your courses, but very quick tutorial can be found here. And the tutorial here should give enough information for you to be able to follow along.

Double click on the shape layer, select “Joins”, and click on the green plus sign in the bottom left corner.

That should open the join dialogue table. For join layer you should select the filteredOutput layer (the one with all the FMS data in it), and for Join field, the column in that attribute table that matches a column in the destination shapefile’s attribute table. In this case, this is the ‘linkName’ column we created. Now select the column name in the attribute table of the geojson shapefile which is “LAD13NM”, for the Target field.

When you click okay it will take you back to the joins window, but now you should see the name of the joined layer appear here, with a little check mark next to it, to indicate we are successful and it has been joined.

You can further verify this by opening up the attribute table from the geojson shapefile, and seeing that the new columns have appeared. Some fields have NULL values, these are the non-matching fields. That’s fine! If we were doing this for a paper or a report, we would want to go back to the data cleaning step and tweak the data manually, to make sure that we retain the maximum amount of data. However the goal here is to give you an overview, so we will move on to our final step in the data cleaning/joining process (finally!) and making sure that all the columns are what we think they are!

So to do this, double click on the layer name, and select this time the “Fields” tab. You can see all your values, and also what the QGIS environment things that they are. If you scroll to the bottom, you can actually see that it things that our counts of FixMyStreet reports are text values. This is not very useful if we want to do some calculations or manipulations with them as numeric values. So the last step is to convert between types. This is a handy step and useful thing to know, as often you will have to convert between text and numeric data, especially from scraped data, open data from the “wild”.

So you can open up the field calculator actually from this window, by clicking on the little abacus button again:

And this will bring up a similar dialogue. Now we will create a new numeric column, where we will enter the number of new complaints, based on the text column of new complaints, joined from our output file. Let’s call this variable numNewReps, and make it an integer (whole number). Now we want to assign the value from the column called filteredOutput_New problems. Conversions are very easy in QGIS field calculator, we just need to use a function called toint(), which converts a text…. you guessed it! TO INTegers.


 toint( "filteredOutput_New problems"  )
 

So you have to copy and paste the above code into the field calculator, and when you are satisfied you can just click OK.

Your new column should now appear in the attribute table. YAY!

Now we can finally visualise our results.

Visualise the results

SO all that is left is to actually present this data in a meaningful way, so we can begin to talk about it! To do this, double click on the geojson layer again, and this time select the “Style” tab.

From the dropdown menu on the top left, select “Graduated”:

Then, under “Column”, select our new numeric varible which we created. Change “Classes” to 3, and “Mode” to Quantile (Equal counts). This way you will be splitting the councils for which we have data into three groups based on the number of new report, a low, a medium, and a high number of report councils. There are of course many other ways to visulaise this data, and depending on the conclusions you want the people who look at your map to draw, you will choose different approaches to grouping. However this could be the topic of a whole new tutorial.

For now, follow these instructions, and when you are finished, click “OK”:

Now you will see you’ve produced a thematic map of new reports made to different councils using FixMyStreet.com platform:

And you did this all yourself, getting the data, clearning the data, and mapping the data, using QGIS and a little bit of python! Isn’t that exciting?

Now take a moment and look at the map, think about what it’s telling you, and think about what more you could do with this data. Maybe look around you in the lab, see where everyone else has gotten to, see if anyone has different interpretations.

You can now use this as a template for getting other tabular data from the web. As long as you can think of a spatial layer to which you would be able to join this table, you will be able to map such webscraped data using what you have learned here today. I hope that this is something useful you can take away with you for your work and your studies.

Twitter

Another source of crowdsourced data is Twitter. Twitter data is made freely available through its Application Programming Interfaces (APIs), which makes it one of the most popular open data sources for studies in social sciences Leetaru et al., 2013. Further, there are lots of data constantly being generated; the Twitter service sees about 300 million Tweets per day Kamath et al., 2013. However, when attempting to map dynamic spatial or temporal fluctuation using tweets, the pool of usable data is somewhat reduced. Studies of the data normally find about 1 to 2 percent of tweets geocoded (Kamath et al., 2013; Leetaru et al., 2013).

Users send 400 million tweets every day. Ranked as the 10th most popular site in the world by Alexa rank in January 2013, Twitter boasts 500 million registered users.

The only way to access 100% of those tweets in real-time is through the Twitter “Firehose”. The other option for accessing tweets is using one of Twitter’s direct API offerings.

Twitter’s Search API, which involves polling Twitter’s data through a search or username. Twitter’s Search API gives you access to a data set that already exists from tweets that have occurred. Through the Search API users request tweets that match some sort of “search” criteria. The criteria can be keywords, usernames, locations, named places, etc. A good way to think of the Twitter Search API is by thinking how an individual user would do a search directly at Twitter (navigating to search.twitter.com and entering in keywords).

How much data can you get with the Twitter Search API? With the Twitter Search API, developers query (or poll) tweets that have occurred and are limited by Twitter’s rate limits. For an individual user, the maximum number of tweets you can receive is the last 3,200 tweets, regardless of the query criteria. With a specific keyword, you can typically only poll the last 5,000 tweets per keyword. You are further limited by the number of requests you can make in a certain time period. The Twitter request limits have changed over the years but are currently limited to 180 requests in a 15 minute period.

Anyone can sign up to access the twitter API, and that is what we will do now, to demonstrate getting some tweets on a map!

Signing up as a developer with Twitter

First, you will need to create a twitter account if you don’t already have one. You will have to visit twitter.com and sign up.

Then you go to apps.twitter.com/ and click on Create New App.

Fill out the details. You can name your app whatever you like. (under ‘website’ it lets you put a placeholder (since you don’t have a site for your app yet). Just make sure it’s an URL that doesn’t already exist!). Fill out the form and then click on ‘Create your twitter application’

Now you should have some credentials, which you will need for getting some data from Twitter. You can see your credentials by clicking on the Keys and Access Tokens tab, circled in red below:

Getting some twitter data into QGIS

[Some description about the process here]

Lucky for us, someone has again written a plugin for getting some tweets into QGIS, so I will not make you write more code right now.

Instead, we will use the twitter2qgis package.

NOTE: you will need to have the tweepy module installed for this package to run properly. You do this the same way that you would have installed the other modules yourself (eg BeutifulSoup and csv, see above!). When you install the plugin, I think it gives you a temporary install of tweepy - but this won’t always work! So safer to have the module installed properly!

OK but for now, let’s progress. Bring up the dialogue window by clicking on Web > twitter2qgis > collect tweets:

It will open up a dialogue window, asking for your twitter details, which we got earlier by registering.

Fill this out with your access token etc information, which you acquired in the above section!

Then also specify a keyword you want to search. Here I used “night tube” because I was wondering what people were tweeting about the (relatively) new 24hour tube service in London. Specify the number of tweets you want to get back (for now let’s choose 10, so we don’t have to wait too long to see some results) and say that you want the file of the raw tweets. When all that’s done, and your form looks like the one below, hit “OK”.

Now you must be patient. As we mentioned, only about 1-2 per cent of tweets are geocoded, and this plugin will only take the geocoded ones. So it will have to cycle through quite a few tweets, before it finds you 10 geocoded ones, with your required key words. Patience is always key with these things. You will get some sort of indicator that QGIS is not responding (because it’s working) so you can sit tight for a bit. Check your (possibly new) twitter account!

Now once it is done, you will see a new window, asking you what coordinate system you would like to map your tweets in:

Here we will select WGS 84

You should now see a new layer appear in your layers list. Right click and view attribute table to have a peak at the sorts of tweets we gathered. In this case, actually we see that we got a whole bunch of tweets talking about “night” but not related to the night tube! This is not exciting.

There are different things we can experiment with when defining a search term. You can see the guidelines around that here. We could seatch for night tube in brackets. Do keep in mind that the more specific your search query, the longer the search will take. twitter2qgis only returns geocoded tweets, so even if a regular search brings back more, they might not have coordinates. Again for the interest of brevity of this demonstration, I will search for just 10 such tweets.

Doing this yourself

The package is experimental, and there are no guarantees on it. It may or may not work as intended. It has some mixed reviews, and to be honest, I have had some issues with it. And since we’re learning how to do this ourselves, if you’re feeling more adventurous, let’s get back in on that Python console, and get some tweets for ourselves!

So the first thing we have to do, is create some variables that hold your keys and access tokens. This is also useful, because you only have to enter it the once! With the twitter2qgis package, you need to enter this every time you use it!

So let’s start by creating these 4 varviables:

CONSUMER_KEY = '...'
CONSUMER_SECRET = '...'
ACCESS_KEY = '...'
ACCESS_SECRET = '...'

Now we can use the tweepy package to retreive some tweets from the Twitter API. You load this package the same way we did the previous ones, using the import function:

import tweepy

Now we can use functions from within this package. Specifically here, we will use functions that authenticate us with your keys and tokens, and function for searching through tweets.

#first create an authentication handler using your consumer key and secret
auth = tweepy.auth.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
#and then you set your access tokens for this authentication
auth.set_access_token(ACCESS_KEY, ACCESS_SECRET)
#and then you use the API function and pass the authentication credentials we've built to get access to the API
api = tweepy.API(auth)
#not that we are deemed OK to go ahead, we can pass a search query using the search() function. Inside we specify that we want tweets where the keyword "night tube" appears in the text. Also, because I am confident that this is faster, I want 100 tweets, so I set the count to 100
search_results = api.search(q="night tube", count=100)

When this is finished, you have a new object called search_results that contains these 100 tweets. You can have a look at what this contains if you want. You can print the whole thing, of if you want just print the first element1 by typing:

search_results[1]

Now you still want to parse this into something that we can open up easily in QGIS. You can also select which bits of data you want into columns for your data table. You can have a read through all the different fields and what information they contain on the Twitter developer pages here.

Let’s say that I want the date and time of the tweet, the username of the person who has tweeted, the text of the tweet itself, and the coordinates, if there are any. To get these into a table, we can use the below code:

#first create a csv file that will store your results. The first parameter I pass is the filename. I name mine here twitterResult.csv. The second parameter I pass in the open() function is the mode. Mode is a string indicating how the file is to be opened. The most commonly-used values of mode are 'r' for reading, 'w' for writing (truncating the file if it already exists), and 'a' for appending (which on some Unix systems means that all writes append to the end of the file regardless of the current seek position). If mode is omitted, it defaults to 'r'. 
csvFile = open('twitterResult.csv', 'a')


#then we use csv writer to write a csv file
csvWriter = csv.writer(csvFile)

#and we write the first row with out hearder names:
csvWriter.writerow(["created_at", "screen_name", "text", "coordinates"])

#now we build what to write. we use a for loop to iterate through all the tweets in our results
for tweet in search_results:
    #we also want to get coordinates, however we can only extract coordinates from tweets where they exist in the first place. So here, we say that if the coordinates are not None (so they do exist), then we want them to be extracted from the coordinates dictionary. 
    if tweet.coordinates is not None: 
        coordinates = tweet.coordinates.get(u'coordinates')
    #if they are not not None (ha-ha) then we can just assign them to be "NA"
    else:
        coordinates = "NA"
        
    #then we use the writerow function to write for each tweet a new row that contains the elements of the tweets we've selected. Here these are the created_at, user.screen_name, and the actual text, as well as the coordinates that we created above.    
    csvWriter.writerow([tweet.created_at, tweet.user.screen_name, tweet.text.encode('utf-8'), coordinates])
    
    #we also print this result, so we can see our script working away in the Python Console window :)
    print tweet.created_at, tweet.user.screen_name, tweet.text, coordinates

#and finally close the file, when we've gotten through all the tweets
csvFile.close()

Now if you look in your working folder, you should see a new .csv file that has appeared, which contains your tweets! Exciting. Add this to your QGIS environment, either manually, or using the Python console again, as we added out fms_reports table:

uri = "file:///" + path + "/" + "twitterResult.csv" + "?type=csv&geomType=none"
layer = QgsVectorLayer(uri, "twitter_table", "delimitedtext") # create layer from text file called fms_table
QgsMapLayerRegistry.instance().addMapLayer(layer) # add layer to QGIS

Now you can have a look at the attribute table:

You can see that not many are geocoded. In fact, in my example there are only two. Normally, if you want to get geocoded tweets, you will have to leave the search running over longer periods of time, retrieving more tweets. As discussed, only about 1-2 percent of them are geocoded, so with our 100 tweets, this is about as many as we’d expect!

Nevertheless, let’s create a new dataframe from only those tweets which have been geocoded. For this, use the select using an equation button, in the attribute table:

It will bring up a dialogue similar to field calculator. Here we need to specify an equation, that, when it returns true, those are the rows which we want to select. You essentially want to select the rows where your coordinates are not NA. So, you can do this by using the following equation in field calculator:

 "coordinates"  IS NOT 'NA'

After this, click on the select buttn in the bottom right hand corner (next to ‘close’)

Now you will see that the rows with coordinates are highlighted.

Go back to your layer, right click, and select ‘Save as…’

When you are saving this new layer, make sure that you tick the box for ‘Save only selected’:

You now have your (tiny) table of geocoded tweets.

So as you can see, the coordinates are not in separate latitude and londitude columns. You guessed it, we’ve got a bit more data cleaning up before we can put these tweets on a map! But we will only be using the field calculator now.

First we want to remove the useless characters (the square brackets) and then we want to split this coordinates column, and assign the first value to a longitude and the second to a latitude column.

We’ve already done some removing, using the replace() function, so you should essentially be able to write this bit yourself! But as a refresher, I’ll show you, I’m creating a new text column, and calling it coords2, and using the following code to replace the characters I don’t need with empty spaces:

replace(replace("coordinates",'[',''),']','')

When done, click OK

Now, to get the longitude, you want the numbers on the left of the column, and you can get that with this equation:

 toreal( left("coords2", strpos("coords2", ',')))

And to get the latitude you need the numbers on the right, which you get with this equation:

 toreal(right("coords2", length("coords2")-(strpos("coords2", ',')+1)))

Use these in field calculator to create latitude and longitude columns! You’ll notice that we are also wrapping the equations into the toreal() function, because we want them to be numbers (as coordinates are) rather than text.

You should also make sure that the columns you’re creating are Decimal number (real) columns, and that you leave enough character spaces for the full coordinates!

Now finally we are ready to map our tweets! Save this layer as a csv

And then read it back in, as a delimited file, but this time WITH GEOMETRY. YAY!

And finally, there is our one tweet:

I’ve added in an open streed map backdrop just to see where it is. Aaaand it’s in Texas. That’s different.

Well that’s all folks, do feel free to play around with this, get some tweets about a topic you might be particularly interested in, or look into restricting your twitter queries spatially. To do that, you can add another parameter to the original search query, in the Python code.

Do you remember where we specified that we were looking for the term “night tube”?

The part of the code that looked like this:

search_results = api.search(q="night tube", count=100)

Well you can add all sorts of further parameters to that search request. You can see some here (hint: search for geocode under ’Help Methods).

So now, if you wanted to restrict the search to tweets within 500 km of London, you can get some coordinates for central London, and go from there:

search_results = api.search(q="night tube",  geocode='51.507351,-0.127758,500km', count=100)

Re-run your search with these specifications, and see what happens!

Final words

So I hope that this tutorial has given you a taste for the wealth of open data that you can gain access to, through the tools that you have available to you, for free! With a bit of practice (and a lot of googling) you should be able to tap into these sources of data, and use them in your work and research. Do keep in mind however all that we discussed during lecture, both around the strengths of these data, and the limitations. Especially when scraping data, consider copyright, privacy, and other concerns around ethical use of these data, as well as the implicit biases inherent in their mode of production. But that said, do take advantage of such data as well, and don’t let the limitations stop you from exploring what new insights they can provide into the topics of your research interests.


  1. Actually this isn’t really the first element but the second, as indexing begins with 0, so search_results[0] would be the first element, but there is absolutely no need to discuss here as part of this tutorial. I just didn’t want to lie to you.